Hypothesis Testing

Workshop 6

For the rest of this course, we’ll be working with data from the U.S. Census Current Population Survey (CPS).

Aims:

Understanding:
- Confidence Intervals
- Hypothesis Testing
  1. Stating the Null and Alternative Hypotheses
  2. Selecting a Critical Value
  3. Calculating the Test Statistic
  4. Making a Decision

Getting Started

We will be following on from the analysis we conducted in Workshop 5 (Distributions and Basic Statistics). We visually explored differences in the income levels between different groups of people in the US census. Now, we are going to conduct hypothesis testing to see if those differences are statistically significant.

#This tells python to draw the graphs "inline" - in the notebook
%matplotlib inline  
import matplotlib.pyplot as plt
from scipy.stats import norm
import statistics
import seaborn as sns

import pylab
import pandas as pd
import numpy as np
# make the plots (graphs) a little wider by default
pylab.rcParams['figure.figsize'] = (10., 8.)
sns.set(font_scale=1.5)
sns.set_style("white")

df=pd.read_csv('https://storage.googleapis.com/qm2/wk7/cps.csv')
df['race']=df['race'].astype('category')
df['income']=df['incwage']/1000
df=df[df['income']<200]
df=df[df['year']==2013] # filter the dataframe to only contain 2013 data
df.head()

	year	state	age	sex	race	sch	ind	union	incwage	realhrwage	occupation	income
20	2013	50	62	1	1	14.0	8090	1.0	57000.0	23.889143	.	57.0
32	2013	39	59	1	1	13.0	9590	0.0	62000.0	29.726475	Consruction, Extraction, Installation	62.0
34	2013	44	44	1	3	12.0	7290	0.0	45000.0	20.745834	.	45.0
36	2013	12	41	1	1	12.0	7070	1.0	28000.0	12.293828	Managers	28.0
37	2013	33	35	1	1	12.0	770	0.0	42500.0	20.377020	Transportation and materials moving	42.5

This is once again the U.S. census data from Week 5. As a reminder, our dataframe has 10 columns:

year: Survey year
age: the person’s age
sex: the person’s sex
- 1=male
- 2=female
race: the person’s race
- White non hispanic=1
- Black non hispanic=2
- Hispanic=3
- Other non hispanic=4)
sch: Educational attainment
- None = 0,
- Grades 1-12 = 1-12
- Some University = 13,
- Associate’s degree = 14,
- BA = 16
- Advanced Degree = 18
union: Union membership
- N/A = 0,
- No union coverage = 1,
- Member of labor union=2,
- Covered by union but not a member=3
incwage: Wage and salary income
realhrwage: Real Hourly Wage
occupation: Occupation
ind: industry code

Confidence Intervals

So far in this workshop, we’ve had the luxury of being able to draw many random samples and plot the distributions of their sample means to infer the population mean. The Central Limit Theorem lets us assume that these sample means are normally distributed, and consequently that there is a 95.45% chance that the population mean within two standard errors of the sample mean. This allows us to make inferences on the basis of a single sample. The standard error is the

Sample Standard Deviation

\[\huge s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2}\]

Standard Error

\[\huge se = \frac{s}{\sqrt{n}}\]

Given a large enough sample \(x\), we can build a 95% confidence interval as follows:

\[ \huge 95\% CI = \overline{x} \pm (1.96 \times se)\]

Let’s draw a sample of 1000 random individuals from our data, and compute a 95% confidence interval to estimate the population mean for income. We’ll begin by creating a swarmplot to get a sense of how the data are distributed.

sample=df.sample(1000) # draw a random sample of 1000 observations from the dataframe
sns.swarmplot(data = sample, y='income') # plot a swarmplot of income
plt.title('Income Distribution') # add a title

Text(0.5, 1.0, 'Income Distribution')

Now let’s set about calculating the 95% confidence interval and plotting it on our swarmplot. Luckily, the components we need to this are easy to calculate. We just need the mean, standard deviation, and number of observations. All of these are provided by the .describe() function, which calculates summary statistics for a sample.

desc=sample['income'].describe() # calculate descriptive statistics for the sample
print(desc) # print the descriptive statistics

count    1000.000000
mean       47.677568
std        33.041244
min         0.700000
25%        25.000000
50%        40.000000
75%        62.000000
max       190.000000
Name: income, dtype: float64

From the set of descriptive statistics, we can pull out the relevant components, calculate the standard error, and create a 95% confidence interval as follows:

mean=desc['mean'] # set the mean equal to a variable called "mean"
std=desc['std'] # set the standard deviation equal to a variable called "std"
n=desc['count'] # set the sample size equal to a variable called "n"
se=std/np.sqrt(n) # calculate the standard error of the mean

print('The mean is', round(mean, 2), 'with a standard error of', round(se, 2)) # print the mean and standard error

upper_ci = mean+se*1.96 # calculate the upper confidence interval
lower_ci = mean-se*1.96 # calculate the lower confidence interval

print('The 95% confidence interval is', round(lower_ci, 2), 'to', round(upper_ci, 2)) # print the confidence interval

The mean is 47.68 with a standard error of 1.04
The 95% confidence interval is 45.63 to 49.73

Finally, let’s plot these bounds on our swarmplot to graphically show this range. We can now claim that based on our sample, there is a 95% chance that the true population mean of income (shown in red) lies within this range.

sns.swarmplot(data = sample, y='income',alpha=0.5) # plot a swarmplot of income
plt.axhline(df['income'].mean(), color='red', linestyle='solid', linewidth=3, label='Population Mean') # add a horizontal line at the mean
plt.axhline(upper_ci, color='black', linestyle='dashed', linewidth=3) # add a dashed black line at the upper confidence interval
plt.axhline(lower_ci, color='black', linestyle='dashed', linewidth=3) # add a dashed black line at the lower confidence interval

plt.title('Income Distribution, 95% Confidence Interval') # add a title

Text(0.5, 1.0, 'Income Distribution, 95% Confidence Interval')

Hypothesis Testing

If we create a boxplot of income disaggregated by sex using our sample, we can observe that men seem to earn more than women:

sns.boxplot(data=sample , x='sex', y='income').set_xticklabels(['Men','Women']) # make a box plot of income by sex

[Text(0, 0, 'Men'), Text(1, 0, 'Women')]

But is this difference statistically significant? It could just be due to sampling error, random chance. Hypothesis testing provides a framework through which we can formally evaluate the likelihood of encountering an effect as extreme (in this case, the the difference between the mean incomes between both groups) as the one we observe in our data. There are four main steps in hypothesis testing:

State the hypotheses. H0 states that there is no difference between the two population means.
Select an \(\alpha\) level (e.g. 95% confidence), and select a corresponding critical value (1.96 for large samples)
Compute the test statistic.
Make a decision; if the test statistic exceeds the critical value, we reject the null hypothesis.

Steps 1, 2, and 4 remain fairly constant regardless of what kind of hypothesis testing you’re conducting. Step 3 can vary quite a bit, as there are many different statistical tests that fall under the umbrella of hypothesis testing. In today’s workshop we’ll be using the Student’s T-Test (more on that in a second). For now, let’s begin the process of hypothesis testing a

1. State the hypotheses

The Null Hypothesis

\(H_0\) : There is no difference in the mean income between men and women
\(H_0\) : \(\overline{x}_{men} = \overline{x}_{women}\)

The Alternative Hypothesis

\(H_a\) : There is a difference in the mean income between men and women
\(H_a\) : \(\overline{x}_{men} \neq \overline{x}_{women}\)

2. Select an \(\alpha\) level

Locate the critical region; the critical values for the t statistic are obtained using degrees of freedom (\(df=n-2\)). Given that we have 1000 observations, \(df=998\). If \(df>1000\), you can simply memorize the following critical values:

At the 95% confidence level, the critical value is 1.96
At the 99% confidence level, the critical value is 2.58

If our test statistic exceeds either of these values, we can reject the null hypothesis with the according level of confidence. The function below creates a plot which provides a visual reference for these values, but isn’t really necessary for the process of hypothesis testing. The function accepts one argument test_statistic, which it will use to plot a vertical red line. If the red line falls within the dotted lines, we fail to reject the null hypothesis at the corresponding confidence level. If it’s outside of these bounds, we reject the null hypothesis.

In the last line of code below, i’ve called the function to plot a test statistic of -2.3; Would we reject the null hypothesis at the 95% confidence level? what about the 99% level?

def plot_z(test_statistic):
    mu, se= 0, 1 # create two variables, a mean "mu" equal to zero, and standard deviation "se" equal to 1
    x = np.linspace(mu - 3*se, mu + 3*se, 100) # create a range of values from -3 to 3 standard deviations

    plt.plot(x, norm.pdf(x, mu, se)) # plot the normal distribution
    plt.axvline(mu-se*1.96, color='blue', linestyle='dashed', linewidth=1.5,label='µ ± 1.96σ (95% confidence)') # plot a vertical line at the mean plus 2 standard deviations
    plt.axvline(mu+se*1.96, color='blue', linestyle='dashed', linewidth=1.5)  # plot a vertical line at the mean minus 2 standard deviations
    plt.axvline(mu-se*2.58, color='green', linestyle='dashed', linewidth=1.5,label='µ ± 2.58σ (99% confidence)') # plot a vertical line at the mean plus 2 standard deviations
    plt.axvline(mu+se*2.58, color='green', linestyle='dashed', linewidth=1.5)  # plot a vertical line at the mean minus 2 standard deviations
    
    plt.axvline(test_statistic, color='red', linestyle='solid', linewidth=1.5,label='Test Statistic') # plot a vertical line at the test statistic


    plt.ylim(0,0.4)
    plt.legend()
    plt.title('Z Distribution') # add a title
    plt.show()

plot_z(-2.3)

3. Calculate the Test Statistic (The Student’s T-Test)

The Student’s T-Test is an independent-measures design which is used in situations where a researcher has no prior knowledge about either of the two populations (or treatments) being compared. In particular, the population means and standard deviations are all unknown. Because the population variances are not known, these values must be estimated from the sample data.

The purpose of a T-test is to determine whether the sample mean difference indicates a real mean difference between the two populations or whether the obtained difference is simply the result of sampling error. Given two groups, \(x_1\) and \(x_2\), the \(t\) statistic is calculated as:

\[ \Huge t = {\frac{\overline{x_1}-\overline{x_2}} {\sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}}} \]

Where:

\(\overline{x}\): Sample Mean
\(s^2\): Sample Standard Deviation
\(n\): Number of observations

We’ve already seen how to calculate each of these components when we made the 95% confidence interval above using the .describe() function. To calculate the t-statistic, we just have to plug these values into the formula above and do some basic arithmetic. I’ve put together a function that does this below, which accepts two main arguments, group1 and group2. For each group it calculates descriptive statistics, and uses these values to calculate the t-statistic. It also has an optional argument plot, which when set to True will plot a 95% confidence interval for each group. It defaults to False, meaning that it won’t generate the plot.

def manual_ttest(group1, group2, plot=False): # define a function called "manual_ttest" that takes two groups and a boolean value for whether or not to plot the results as arguments
    
    desc1, desc2=group1.describe(), group2.describe() # get descriptive statistics for both samples
    
    n1,std1,mean1 = desc1['count'], desc1['std'] ,desc1['mean'] # get the sample size, standard deviation, and mean of the first sample
    n2,std2,mean2 = desc2['count'], desc2['std'] ,desc2['mean'] # get the sample size, standard deviation, and mean of the second sample
    
    # calculate standard errors
    se1, se2 = std1**2/n1, std2**2/n2 # '**2' is the same as squaring the number

    # standard error on the difference between the samples
    sed = np.sqrt(se1 + se2)

    # calculate the t statistic
    t_stat = (mean1 - mean2) / sed

    # print the results
    print("Group 1: n=%.0f, mean=%.3f, std=%.3f" % (n1,mean1,std1)) 
    print("Group 2: n=%.0f, mean=%.3f, std=%.3f" % (n2,mean2,std2))
    print('The t-statistic is %.3f' % t_stat) # print the t-statistic

    if plot==True: # if the plot argument is set to True, plot the results
        groups=pd.DataFrame() # create an empty dataframe
        i=1 # create a counter variable called "i" and set it equal to 1
        
        for group in [group1, group2]: # loop through each group in the list of groups
            plot_df=pd.DataFrame({'Values': group,'Group':i}) # create a dataframe with the values of the group and a column called "Group" that contains the group number
            groups=groups.append(plot_df) # append the dataframe to the list of dataframes
            i+=1 # increase the counter by 1
        
        sns.pointplot(data=groups , x='Group', y='Values',errorbar=('ci', 95), color='black', join=False, capsize=.8) # plot the means of the groups with a 95% confidence interval
        plt.title('Comparison of Group Means with 95% Confidence Intervals') # add a title
    
    return t_stat # return the t-statistic

Having defined the function, we can now call it to calculate a t-test for the difference in income between men and women

men=sample[sample['sex']==1] # filter the sample to only include men
women=sample[sample['sex']==2] # filter the sample to only include women

t = manual_ttest(men['income'],women['income']) # run the t-test function and store the t-statistic in a variable called "t"

Group 1: n=485, mean=57.435, std=36.610
Group 2: n=515, mean=38.489, std=26.179
The t-statistic is 9.364

4. Make a Decision

If the t statistic indicates that the obtained difference between sample means (numerator) is substantially greater than the difference expected by chance (denominator), we reject H0 and conclude that there is a real mean difference between the two populations or treatments. Let plot the T-statistic from our test against the critical values:

plot_z(t) # plot the test statistic on the z distribution

Based on the plot above, can we reject the null hypothesis that there is no difference in mean income between men and women?

Exercise

From the main dataframe df, draw a sample of 500 white men. Using t-tests, investigate whether there are statistically significant discrepancies in pay between white men and other groups (note: it would be best to sample 500 people in each of those groups as well). Between what groups does there exist the most significant pay gap?
Some of this variation may be due to occupation. Compare income disparities between men and women within different occupations. Which occupation has the largest pay gap? which has the smallest?
Research suggests that within occupational groups, collective bargaining through union membership reduces pay gaps. Read the abstract of this article, and try to replicate the analysis using our dataset.

Assessed Question

When Elon musk bought Twitter, he promisted to restore “free speech” to the platform. He heralded this new era with a tweet on 28/10/2022, which read “the bird is freed”. A tidal wave of hate speech ensued instead.

Using twitter’s API, I downloaded tweets containing a racial slur. Using the groupby function and regex, I counted the number of mentions of this word per hour on the platform for about a month before the takeover, and a few days thereafter. I’ve saved these counts (but not the tweets themselves) as a csv file called “elon_tweets.csv”.

The code below downloads this csv file, and plots the number of slur-containing tweets over time.

import datetime
from matplotlib.pyplot import figure
import matplotlib.dates as mdates

tweets=pd.read_csv('https://storage.googleapis.com/qm2/wk7/elon_twitter.csv') # read in the data
figure(figsize=(10, 4), dpi=200)
tweets['hour']=pd.to_datetime(tweets['hour'])

tweet=datetime.datetime(2022, 10, 28)

pre_mean=tweets[tweets['hour']<tweet]['count'].mean()
post_mean=tweets[tweets['hour']>tweet]['count'].mean()
pct_change= int(((post_mean-pre_mean)/pre_mean)*100)

plt.ylabel('Hourly Slur Tweets')

plt.plot_date(tweets['hour'], tweets['count'], 'b', color='red')
plt.axvline(tweet, color='black', linestyle='dashed', label='"the bird is freed"')
plt.legend()
plt.title('Elon Musk tweets "The bird is freed"; Tweets containing racial slurs increase {}%'.format(pct_change))
plt.gca().xaxis.set_major_locator(mdates.DayLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%d/%m'))

/var/folders/6q/jt4x0r8n1rs0kbrrqrbj61fr0000gn/T/ipykernel_97088/4258694155.py:17: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "b" (-> color=(0.2980392156862745, 0.4470588235294118, 0.6901960784313725, 1)). The keyword argument will take precedence.
  plt.plot_date(tweets['hour'], tweets['count'], 'b', color='red')

This plot definitely shows an uptick in the number of tweets containing a racial slur following Musk’s tweet. But is this increase statistically significant?

Question: Using a t-test and the full hypothesis testing procedure, investigate wheter there was a statistically significant increase in hate speech following Elon Musk’s tweet. Make note of the T statistic and the P value.